Digital systolic array architecture and method for computing the discrete fourier transform

ABSTRACT

The present invention provides a more computationally efficient and scalable systolic architecture for computing the discrete Fourier transform. The systolic architecture also provides a method for reducing the array area by limiting the number of complex multipliers. In one preferred embodiment, the design improvement is achieved by taking advantage of a more efficient computation scheme based on symmetries in the Fourier transform coefficient matrix and the radix-4 butterfly. The resulting design provides an array comprised of a plurality of smaller base-4 matrices that can simply be added or removed to provide scalability of the design for applications involving different transform lengths to be calculated. In this embodiment, the systolic array size provides greater flexibility because it can be applied for use with any transform length which is an integer multiple of sixteen.

[0001] This application claims the benefit of U.S. Provisional PatentApplication No. 60/380,494 filed May 14, 2002.

BACKGROUND OF THE INVENTION

[0002] (1) Field of the Invention

[0003] The present invention relates generally to systems and methodsfor computing a discrete Fourier transform and, more particularly, tosystems and methods for designing and providing systolic arrayarchitectures operable for computing the discrete Fourier transform orfast Fourier transform.

[0004] (2) Description of the Prior Art

[0005] Computational schemes for computing the discreet Fouriertransform (DFT) have been intensively studied because of their centralimportance to many digital signal processing applications. A generalpurpose computer can be programmed to provide this computation but dueto the architectural limitations thereof does not operate quickly enoughfor many applications. Specialized processing circuits are thereforefrequently used for this purpose. Such specialized processing circuitsof the prior art are primarily based on architectural computingstructures for computing the DFT using the fast Fourier transform (FFT)method. These “pipelined” architectures, in which data flows through achain of functional arithmetic/memory units, contain many differenttypes of memory and logic components, require computing interconnectionsthat usually span the entire circuit, and require formating of input andout data. Thus, they are irregular and difficult to translate into VLSIhardware. In addition they are only able to compute DFTs with limitedchoices in the number of transform points.

[0006] For the most part, the prior art for determining a DFT is notable to utilize or fully benefit from systolic processing structuresthat permit simultaneous operation of multiple processors to find theDFT by operating on smaller parts of the problem in parallel. Thesystolic model, introduced in 1978 by Kung and Leiserson, has proveditself to be a powerful tool for construction of integrated, specialpurpose processors. However, specialized prior art systolic arrays forcomputing the DFT, either directly or via FFT techniques requireexcessive arithmetic hardware, especially for large transform sizes, andare slow both in terms of computational latency (time to do a single DFTcomputation) and computational throughput (time between successive DFTcomputations).

[0007] A systolic architecture as used herein refers to an array ofsystolic cells, such as simple processors, which can be implemented asintegrated circuits to provide efficient, parallel use of a large numberof processors to thereby perform multiple simultaneous computations. Thedata passes from processing element to processing element. Once data hasbeen read from external memory into the array, several computations canbe performed on the same datum. The mode of computation isarchitecturally different from that which performs a memory access eachtime a data item is used by one or relatively few processors—the latterbeing a characteristic of the von Neumann architecture. Thus, thesystolic array has a major advantage over traditional architectureswhich are limited by the ‘von Neumann bottleneck.’

[0008] The processing elements in systolic arrays operate synchronouslywhereby each cell receives data from its neighboring cells, performs asimple calculation, and then transmits the results (only to itsneighbors) one cycle later. Only those cells that are at the edge of thenetwork communicate with the external world, e.g., with I/O circuitry.Thus, the processing elements connect only to the closest neighboringprocessing elements which may be typically be located north-south,east-west, northeast-southwest, and/or northwest-southeast. In asystolic array, local interconnections from each processing element areeach of a length approximately equal to the size of each processingelement. In other words, the interconnections are short compared to thedimensions of the integrated circuit. The simplified regular localinterconnections of systolic arrays are relatively easier to implementin silicon wafers than complex interconnections found in many circuits.Presently available field programmable gate array (FPGA) based hardwareprovides enormous flexibility in making design choices.

[0009] Prior art non-systolic arrays for computing DFTs or FFTs haveutilized an irregular design structure that is not scalable. Because thestructure of the prior art circuits is irregular, it is not practical toutilize the same prior designs, only with increased size, to produce anew circuit that would compute a DFT with a different transform length,i.e., the design is not scalable. Therefore, design costs for aspecialized application of a non-systolic array remain high because anentirely new design is generally required any time the transform lengthis changed. Moreover, prior art non-systolic array designs for computingDFTs are severely limited in application because the transform lengthwhich they can calculate has to satisfy the restriction N=r^(m) where Nis the transform length, r is the radix, and m is an integer. Thisexponential restriction significantly limits the practical transformlengths that can be computed, especially for large transform sizes. Forexample, if r=2 transform lengths between N=4096 (m=12) and N=8192(m=13) would not be directly computable. Many prior art non-systolicdesigns use r=4 because this permits significant arithmeticsimplification, but practical transform lengths are even more severelylimited.

[0010] Systolic array designs have been more costly to implement in theprior art for computing DFTs at least in part because they require atleast the same number of fixed point complex multipliers as thetransform length to be computed. Complex fixed point multiplierprocessing elements are more costly than complex fixed point addersbecause they consume considerably more chip area and power. Fixed pointmeans that each transform point is represented by a specific number ofbinary digits. Thus a word of 16 bits can represent numbers up to 2¹⁶.Floating point numbers allow a lot more flexibility because they use aset of exponent bits to provide a much wider range of values that aspecific number of bits can represent. But working with two floatingpoint numbers adds considerable complexity, because the exponents haveto be compared and “aligned”, which involves a shift of one of thewords. Consequently, floating point processing elements haveconsiderable additional complexity that makes them less desirable for inmultiple processor applications. The efficiency of such designs in termsof the area of the circuit (which is mainly determined by the number ofmultipliers) and the time to do successive DFTs, is relatively low.Moreover, the processing elements in prior art systolic array designsoften have to be capable of performing both adding and multiplyingfunctions thereby further increasing the cost and circuit area ofimplementation.

[0011] The following patents show attempts to solve the above or relatedproblems:

[0012] U.S. Pat. No. 4,777,614, issued Oct. 11, 1988, to J. S. Ward,discloses a digital data processor for matrix-vector multiplication, andcomprises a systolic array of bit level, synchronously clock activatedprocessing cells each connected to its row and column neighbors. On eachclock cycle, each cell multiplies an input bit of a respective vectorcoefficient by a respective matrix coefficient equal to +1, −1 or 0, andadds it to cumulative sum and carry input bits. Input vector coefficientbits pass along respective array rows through one cell per clock cycle,Contributions to matrix-vector product bits are accumulated in arraycolumns. Input to and output from the array is bit-serial, wordparallel, least significant bit leading, and temporally skewed.Transforms such as the discrete Fourier transform may be implemented bya two-channel device, in which each channel contains two processors ofthe invention with an intervening bit serial multiplier. Processors ofthe invention may be replicated to implement multiplication by largermatrices. This type of systolic array requires N complex multipliers andN complex adders, takes at least N time-steps to compute successiveDFTs, and has a latency of 2N−1 to do a single DFT. These numbers arefar higher than for the systolic method described hereinafter.

[0013] U.S. Pat. No. 6,098,088, issued Aug. 1, 2000, to He et al,discloses a real-time pipeline processor, which is particularly suitedfor VLSI implementation, is based on a hardware oriented radix-2²algorithm derived by integrating a twiddle factor decompositiontechnique in a divide and conquer approach. The radix-2² algorithm hasthe same multiplicative complexity as a radix-4 algorithm, but retainsthe butterfly structure of a radix-2 algorithm. A single-pathdelay-feedback architecture is used in order to exploit the spatialregularity in the signal flow graph of the algorithm. For a length-N DFTtransform, the hardware requirements of the processor proposed by thepresent invention is minimal on both dominant components: Log4N−1complex multipliers, and N−1 complex data memory.

[0014] U.S. Pat. No. 6,023,742, issued Feb. 8, 2000, to Ebeling et al.discloses a configurable computing architecture with its functionalitycontrolled by a combination of static and dynamic control, wherein theconfiguration is referred to as static control and instructions arereferred to as dynamic control. A reconfigurable data path has aplurality of elements including functional units, registers, andmemories whose interconnection and functionality is determined by acombination of static and dynamic control. These elements are connectedtogether, using the static configuration, into a pipelined data paththat performs a computation of interest. The dynamic control signals aresuitably used to change the operation of a functional unit and therouting of signals between functional units. The static control signalsare provided each by a static memory cell that is written by a host. Thecontroller generates control instructions that are interpreted by acontrol path that computes the dynamic control signals. The control pathis configured statically for a given application to perform theappropriate interpretation of the instructions generated by thecontroller. By using a combination of static and dynamic controlinformation, the amount of dynamic control used to achieve flexibleoperation is significantly reduced.

[0015] U.S. Pat. No. 5,098,801, issued Mar. 3, 1992, to White et al.,discloses a modular, arrayable, FFT processor for performing apreselected N-point FFT algorithms. The processor uses an input memoryto receive and store data from a plurality of signal-input lines, and tostore intermediate butterfly results. At least one Direct FourierTransformation (DFT) element selectively performs R-point direct Fouriertransformations on the stored data according to the FFT algorithm.Arithmetic logic elements connected in series with the DFT stage performrequired phase adjustment multiplications and accumulate complex dataand multiplication products for transformation summations. Accumulatedproducts and summations are transferred to the input memory for storageas intermediate butterfly results, or to an output memory for transferto a plurality of output lines. At least one adjusted twiddle-factorstorage element provides phase adjusting twiddle-factor coefficients forimplementation of the FFT algorithm. The coefficients are preselectedaccording to a desired size for the Fourier transformation and arelative array position of the arrayable FFT processor in an array ofprocessors. The adjusted twiddle-factor coefficients are those requiredto compute all mixed power-of-two, power-of-three, power-of-four, andpower-of-six FFTs up to a predetermined maximum-size FFT point value forthe array which is equal to or greater than N.

[0016] The non-systolic prior art designs discussed above for computingDFTs are irregular, non-scalable, difficult to design, and costly toimplement. Prior art systolic designs are more costly and lessefficient. Prior art systolic designs require the same number, or more,of complex multipliers than the transform length to be performed. Itwould be highly desirable to reduce the number of multipliers by atleast a factor of four as compared to prior art systolic designs. Priorart systolic systolic designs have not been able to take advantage orradix-4 structures or other bases. The prior art systolic designsrequire more complicated processing elements that are required tomultiply and add. It would be desirable to be able to provide a hardwareimplementation of a systolic array whereby the processing elements areonly required to multiply or add, but not both, and the required numberof multipliers is significantly reduced. The base-4 systolic designdisclosed herein exploits the computational efficiency of a radix-4butterfly, yet transform lengths must only be a multiple of sixteen,compared to traditional radix-4 designs which must satisfy N=4^(m),where N is the transform length and m is an integer. Consequently, thoseskilled in the art will appreciate the present invention which providessolutions to the above and other problems.

SUMMARY OF THE INVENTION

[0017] Accordingly, it is an objective of the present invention toprovide an improved system and method for determining a DFT either bydirect methods or via FFT techniques.

[0018] Another objective is to provide a system and method as aforesaidwhich comprises a systolic array architecture of processing elementsorganized with an improved structure.

[0019] A further objective is to provide a system and method asaforesaid whereby the structure of the systolic array can be selected tobe hardware efficient, and/or scalable, and/or have a high throughputand low latency.

[0020] These and other objectives, features, and advantages of thepresent invention will become apparent from the drawings, thedescriptions given herein, and the appended claims. However, it will beunderstood that the above listed objectives and advantages of theinvention are intended only as an aid in understanding aspects of theinvention, and are not intended to limit the invention in any way, anddo not form a comprehensive list of objectives, features, andadvantages.

[0021] Accordingly, the present invention provides a method forbuilding/designing a digital hardware processor architecture forcomputing a discrete Fourier transform. The discrete Fourier transformmay be defined by a first equation wherein one or more matrices Z offrequency domain values are equal to one or matrices X of time domainvalues times a one or more coefficient matrices. The method may compriseone or more steps such as, for example only, providing that the one ormore matrices Z of frequency domain values are equal to a plurality ofmatrices multiplied together wherein at least two of the plurality ofmatrices have matrix products which involve only or primarily complexnumber additions.

[0022] Additional steps may comprise generating one or more systolicarray designs wherein the one or more systolic array designs eachcomprise a plurality of processing elements. Each of the plurality ofprocessing elements may be adjacent to one or more neighboringprocessing elements. Each of the plurality of processing elements may beelectrically interconnected to at least one of the one or moreneighboring processing elements. The method may also comprisedetermining the digital hardware discrete Fourier transform processorfrom the one or more systolic array designs.

[0023] In a preferred embodiment, the one or more systolic array designshave a structure based on an equation of the form

Y=W ^(t) _(M) ·C _(M1) X

Z=C _(M2) Y ^(t)

[0024] wherein Y, Z, W^(t) _(M), C_(M1), C_(M2), and X comprise one ormore matrices, and the matrix products C_(M1)X and C_(M2)Y^(t) involvepredominantly complex number additions. In one embodiment of the digitalprocessor architecture wherein base is equal to four, C_(M1) and C_(M2)consists only of elements defined by the set {i, −i, 1, −1} and thematrix products C_(M1)X and C_(M2)Y^(t) consist only of complexadditions. In another embodiment, of the digital processor architectureC_(M1) and C_(M2) consists only of elements defined by the set{1/sqrt(2)*(i+1), 1/sqrt(2)*(−i+1), 1/sqrt(2)*(i−1), 1/sqrt(2)*(−i−1),i, −i, 1, −1} wherein the base is 8 and the matrix products wouldinvolve a small amount of complex multiplication.

[0025] Additional steps comprise providing that the one or more arraydesigns are compatible for use with any transform length which is aninteger multiple of 16. Additionally, the plurality of processingelements may be organized into regular blocks of base-4 matrices whereinthe one or more array designs are scalable to different transformlengths by varying a number of the regular blocks of base-4 matrices.

[0026] In one embodiment of the invention, the one or more array designscomprises a pipelined linear array which is scaleable so that the numberof the regular blocks of base-4 matrices are added only to one lengthdimension of the pipelined linear array to permit processing of anincreased transform length. By pipelined linear array it is meant thatthe length dimension of the array increases or varies with the transformlength but the width dimension is constant, e.g., an array wherein thelength is N/b and the width is b such that N is the transform length andb is a base such that N is divisible by b². In a preferred embodiment, bis equal to 4 but the invention comprises the use of other bases aswell.

[0027] The plurality of processing elements comprise a total number ofcomplex multiplier processing elements of no more than the transformlength divided by a base value which is equal to four in the preferredembodiment. However, the base value could use other bases such as eight,sixteen, or other bases which are power of two.

[0028] The plurality of processing elements preferably operatesynchronously and each of the plurality of complex addition processingelements are not required to both add and multiply. In one embodiment,the digital hardware discrete Fourier transform processor is operablefor performing a two-dimensional discrete Fourier transform.

[0029] In another embodiment, the invention comprises a digitalprocessor architecture implemented in one or more integrated circuitsoperable for computing a discrete Fourier transform. The discreteFourier transform has a transform length N. The processor comprises oneor more elements such as, for instance, at least one systolic arrayformed from a plurality of processing elements. The plurality ofprocessing elements may further comprise a plurality of complexmultiplier processing elements and a plurality of complex adderprocessing elements. The plurality of complex multiplier processingelements are, in one presently preferred hardware efficient embodiment,limited in number to no more than N/4. The plurality of complex adderprocessing elements are organized into one or more sub arrays of atleast one systolic array such that the one or more sub arrays has afirst dimension four complex adder processing elements long and a seconddimension N/4 complex adder processing elements long.

[0030] The digital processor architecture may further comprise asystolic array operable to perform a discrete 2-D DFT of size n1 by n2.This is done by performing a 1-D DFT on the n1 rows (or n2 columns),followed by another 1-D DFT on the n2 columns (or n1 rows), utilizingthe same systolic array for all computations.

[0031] The digital processor architecture may further comprise asystolic array operable to perform a discrete 1-D Fourier transform oflength N, where N=n1*n2. This done by treating computation as a 2-D n1by n2 DFT as heretofore described.

[0032] In these latter two embodiments, the plurality of complex adderprocessing elements is limited in number to no more than8*max(n1/4,n2/4) and the plurality of complex multiplier processingelements is limited in number to no more than max(n1/4,n2/4).

[0033] In another embodiment, the invention comprises a digitalprocessor architecture for computing a discrete Fourier transform maycomprise at least one systolic array formed from a plurality ofprocessing elements for the at least one systolic array. The pluralityof processing elements may further comprise a plurality of complexmultiplier processing elements and a plurality of complex adderprocessing elements. The systolic array has a structure based on anequation of the form:

Y=W ^(t) _(M) ·C _(M1) X

Z=C _(M2) Y ^(t)

[0034] wherein Y, W^(t) _(M), C_(M1), and C_(M2) comprise one or morematrices and the matrix products C_(M1)X and C_(M2)Y^(t) involve onlycomplex number additions in the preferred embodiment.

[0035] The digital processor architecture may further comprise memorystructures external to the systolic array that contains input values forX and coefficients C_(M2). The plurality of complex multiplierprocessing elements may comprise internal memory structures capable ofstoring data values from matrix W^(t) _(M).

[0036] In one embodiment, the plurality of complex multiplier processingelements may be organized into a one dimensional array that has a lengthof N/base complex processing elements where N is a transform length ofthe discrete Fourier transform and the base value is equal to 4 in thepreferred embodiment. Moreover, the plurality of primarily complex adderprocessing elements may be organized into one or more sub arrays of thesystolic array with a first dimension complex adder processing elementsof width equal to the base and N/base complex adder processing elementslong, where the base value is equal to 4 in the preferred embodiment.

BRIEF DESCRIPTION OF THE DRAWINGS

[0037] A more complete understanding of the invention and many of theattendant advantages thereto will be readily appreciated as the samebecomes better understood by reference to the following detaileddescription when considered in conjunction with the accompanying drawingwherein corresponding reference characters indicate corresponding partsthroughout several views of the drawings and wherein:

[0038]FIG. 1A provides schematics for two minimum area, latency optimal,systolic array designs for computing a transform length N=64 which arein accord with the present invention;

[0039]FIG. 1B is space-time view of an array design with a projection intime from the base;

[0040]FIG. 2A. is a schematic for latency optimal, throughput optimal,1-D DFT systolic designs with variable Z available along the edge of thearray for computing a transform length N=64 which are in accord with thepresent invention;

[0041]FIG. 2B is a is a schematic for latency optimal, throughputoptimal, 1-D DFT systolic designs with variable X available along theedge of the array for computing a transform length N=64 which are inaccord with the present invention.

[0042]FIG. 3 provides two different perspectives of the space-time viewfor two DFT iterations corresponding to the systolic array in FIG. 2B.in accord with the present invention.

[0043]FIG. 4A provides a schematic for a high throughput DFT systolicdesign which provides the same throughput regardless of the transformlength wherein data needs to be formatted for input and output;

[0044]FIG. 4B provides a schematic for a high throughput DFT systolicdesign which provides the same throughput regardless of the transformlength wherein no formatting of X and Y data is necessary becauseformatting is done internally to the array;

[0045]FIG. 5 provides two perspectives of space-time views of two columnand two row DFT iterations performed by the systolic arrays in FIGS. 2Aand 2B, respectively (N=64), wherein the variable Y is a plane and canbe seen in between the IM1 and IM2 variables;

[0046]FIG. 6 is a schematic for a hardware efficient DFT systolic designfor use with a transform length equal to thirty-two;

[0047]FIG. 7 is a schematic for complex multiplier processing elementfor use in the circuit of FIG. 6;

[0048]FIG. 8 is a schematic for a complex adder processing element foruse in an adder array for the circuit of FIG. 6;

[0049]FIG. 9 is a schematic for a different complex adder processingelement from that of FIG. 8 for use in an adder array for use in thecircuit of FIG. 6;

[0050]FIG. 10 is a schematic for control element for use in the circuitof FIG. 6.

BRIEF DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0051] The DFT is defined as $\begin{matrix}\begin{matrix}{{Z\lbrack k\rbrack} = {\sum\limits_{n = 1}^{N}{{X\lbrack n\rbrack}^{{- {j{({2\quad {\pi/N}})}}}{({k - 1})}{({n - 1})}}}}} & \quad & \quad & {{k = 1},{2\quad \ldots \quad N}}\end{matrix} & (1)\end{matrix}$

[0052] where X[n] are the time domain input values and Z[k] are thefrequency domain outputs. It is convenient to represent (1) in thematrix terms

Z=CX  (2)

[0053] where C is a coefficient matrix containing elements W_(N)^(kn)=e^(−2*j*π*(n−1)*(k−1)/N). In order to transform C into the desiredbase-b format it is necessary to find a permutation matrix P thatreorders X and Z according to $\begin{matrix}{{X_{b = 4} = {{P\begin{bmatrix}X_{1} \\X_{2} \\X_{3} \\X_{4} \\X_{5} \\\vdots \\X_{13} \\X_{14} \\X_{15} \\X_{N}\end{bmatrix}} = \begin{bmatrix}X_{1} \\X_{1 + {N/4}} \\X_{1 + {N/2}} \\X_{1 + {3{N/4}}} \\X_{2} \\\vdots \\X_{N/4} \\X_{N/2} \\X_{3{N/4}} \\X_{N}\end{bmatrix}}},{{{and}\quad Z_{b = 4}} = {P\quad {Z\quad.}}}} & (3)\end{matrix}$

[0054] With this value of P, C can be transformed into C_(b)=PCP^(t), sothat Z_(b)=C_(b)X_(b). This transformation allows C_(b) to be written asan (N/b)×(N/b) array of b×b blocks, each block C[i,j] specified by

C _(M) [i,j]=W _(M) [i,j]*c _(D((j−1)mod(b))+1) C_(((i−1)mod(b))+1)  (4)

[0055] where c_(Di)=c_(i) ^(t)I with c_(i) a b-element vector, C_(i) isa b×b matrix, each row being c_(i) ^(t), and W_(M)[i, j] is an elementin the (N/b)×(N/b) matrix, $\begin{matrix}{W_{M} = {\begin{bmatrix}1 & 1 & 1 & 1 & \cdots \\1 & W^{1} & W^{2} & W^{3} & \cdots \\1 & W^{2} & W^{4} & W^{6} & \cdots \\1 & W^{3} & W^{6} & W^{9} & \cdots \\\vdots & \vdots & \vdots & \vdots & ⋰\end{bmatrix}\quad.}} & (5)\end{matrix}$

[0056] With this block reformulation it is possible to factor (2) into

Y=W _(M) ·C _(M1) X

Z=C _(M2) Y ^(t)  (6)

[0057] where “·” in (6) corresponds to an element by element multiply.In (6) C_(M1) and C_(M2) contain N/b submatrices C_(B)=[c₁|c₂| . . .c_(b)]^(t) with the form C_(M1) = [C_(B)^(t)|C_(B)^(t)|…]^(t)

[0058] and C_(M2)=[C_(B)|C_(B)| . . . ], and Z,X have been redefined forb=4 as follows: $\begin{matrix}{{Z = \begin{bmatrix}Z_{1} & \quad & Z_{N/4} \\Z_{1 + {N/4}} & \ldots & Z_{N/2} \\Z_{1 + {N/2}} & \quad & Z_{3{N/4}} \\Z_{1 + {3{N/4}}} & \quad & Z_{N}\end{bmatrix}},{X = {\begin{bmatrix}X_{1} & \quad & X_{N/4} \\X_{1 + {N/4}} & \ldots & X_{N/2} \\X_{1 + {N/2}} & \quad & X_{3{N/4}} \\X_{1 + {3{N/4}}} & \quad & X_{N}\end{bmatrix}\quad.}}} & (7)\end{matrix}$

[0059] For base-4 designs (b=4), ${c_{1} = \begin{bmatrix}1 \\1 \\1 \\1\end{bmatrix}},{c_{2} = \begin{bmatrix}1 \\{- j} \\{- 1} \\j\end{bmatrix}},{c_{3} = \begin{bmatrix}1 \\{- 1} \\1 \\{- 1}\end{bmatrix}},{c_{4} = \begin{bmatrix}1 \\j \\{- 1} \\{- j}\end{bmatrix}}$

[0060] and C_(B) describes a radix-4 decimation in time butterfly.

[0061] By comparing (6) with (2), the computational advantages of themanipulation leading to (6) for base-4 designs are readily evident. In(6) the matrix products C_(M1)X and C_(M2)Y^(t) involve only additionsbecause the elements of C_(M1) and C_(M2) contain only ±1 or ±j, whereasthe product CX in (2) requires complex multiplications. Also, the sizeof the coefficient matrix W_(M) in (6) is (N/b)×(N/b) vs. the N×N sizeof C in (2); consequently the number of overall direct multiplicationsin (6) is reduced compared to (2). Note that for systolicimplementations a distribution of the elements in C_(M1) and C_(M2) doesnot impose significant bandwidth requirements because full complexnumbers are not used.

[0062] In a presently preferred embodiment of the invention, equation(6) is utilized within a systolic CAD tool which is capable of allowingthe user to specify an algorithm with traditional high-level code, setsome architectural constraints and then view the results in a meaningfulgraphical. A preferred systolic CAD tool may be found athttp://www.centar.com under the name SPADE. SPADE takes as input a codedform of a system of non-uniform affine recurrence equation

w ₁(A ₁(I))=g( . . . w _(i)(B _(i1)(I)), . . . ) for all I in I ₁

w _(n)(A _(n)(I))=f( . . . w _(i)(B _(in)(I)), . . . ) for all I inI_(n)  (8)

[0063] where f,g represent the functional variable dependencies, I_(j)is the index range for equation j, w_(i) is one of the algorithmvariables and the affine indexing functions are

A(I)=AI+a; B(I)=BI+b.  (9)

[0064] Here, A/a, and B/b are integer matrices/vectors. For eachalgorithm variable x SPADE finds an affine transformation, T, such that

T(x)=T _(x)(A _(x) I+a _(x))+t _(x),  (10)

[0065] where T_(x) is a matrix and t_(x) is a vector. Thus, T_(x) andt_(x) represent affine “mappings” from one index set to another indexset, where the latter index set includes one index representing timewith the remaining indices representing spatial coordinates. Everyvariable and its associated calculation receive a unique mapping to“space-time.” The projection of the space-time structure along the timeaxis produces a systolic array after an elemental processing element(PE) is associated with each grid coordinate. In addition to the outputsT, t for each of the algorithm variables, SPADE computes a set ofvectors indicating the direction of data flow between PEs for eachdependency in the algorithm. SPADE uses an exhaustive search processthat examines all combinations of possible transformations to finddesigns that result in the minimum computation time, defined as themaximum difference between temporal extrema in space-time (latencyoptimal). To facilitate identification of suitable array designs, SPADEallows the designer to eliminate from the search space designs that failcertain constraint tests that define architectural features. Forexample, if it is desired that a variable appear only at the PE arrayedge, then the position points in space-time of that variable'selements, e.g., X[i,j] in (6), must be such that the unit time vectoru_(t) is in the plane of these position points (hence X's projectiononto the PE array is a line). This “time align” constraint requires thatthe normal n_(a) to the plane of X satisfies u_(t)·n_(a)=0. (If it isdesired that a variable not project to a line, then a constraint can beset that requires u_(t)·n_(a)≠0.) Also, for the projection of X to be atthe edge of the PE array it must constrained to not be within the convexhull of the polygon defining the PE array. Constraints like these areuseful because they provide the designer with only the solutions inwhich he is interested, while considerably limiting the space ofpossible systolic designs that SPADE has to explore.

[0066] Various possible systolic designs may be found by SPADE forcalculating the DFT from (6). In this example, the goal was to findarchitectures which provides the highest throughput, but uses theminimum array area. Even though SPADE uses a minimum latency objectivefunction, the designs it finds are often the most regular and thereforemost suitable for providing high throughput. SPADE also providesconsiderable flexibility in examining non-optimal solutions as well.

[0067] To start the design process it is necessary to convert (6) intothe coded form

[0068] for j to N/b do

[0069] for k to N/b do

Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1 . . . b);  (11)

[0070] od;

[0071] for k to b do

Z[k,j]:=add(CM2[k,i]*Y[j,i],i=1 . . . N/b);

[0072] od;

[0073] od;

[0074] where b is the base (b=4 in the preferred embodiment) and N isthe DFT size. This code follows directly from (6) because the “add”function is interpreted by the SPADE parser as a summation over theindex i. SPADE accepts (11), which is a subset of the Maple™ language,directly as input. It can be seen from the loop limits in (13) that itis necessary for the transform length to be divisible by 4.

[0075] The analysis process starts by setting all the desiredarchitectural constraints, some of which are discussed herein, and thenrelaxing these successively in order of the designer's criteria ofdecreasing importance, if no satisfactory designs are found. Desiredarchitectural constraints are found by examining (11) and (6) inconjunction with typical desired characteristics of systolic arrays.Typically, some experimentation is required to find the best tradeoffbetween throughput and area.

[0076] In this example, it was chosen to select as the most importantconstraint that the number of complex multipliers be minimized. However,as noted above, the designer is free to select any constraint, forinstance maximum speed and uniformity of design. As noted earlier, theproducts C_(M1)X and C_(M2)Y^(t) only require complex additions.However, from (11) it can be seen that there has to be a complexmultiplication associated with every element Y[j,k] of Y. Themultiplications take place at the space-time locations of the elementsof Y. Therefore it is desirable in this particular embodiment that Y betime aligned so that it projects onto the PE array as a linear array ofmultipliers rather than a 2-D array of multipliers. Variables that aretime aligned require O(N) memory registers per PE because a 2-D variableis projected onto a linear array of O(N) PEs. Next, it would also bedesirable to avoid a network flow of complex data from W^(M) to Y,because of the bandwidth required to support this. To avoid thispossibility, the designer may selectively choose to set SPADE toautomatically use the same transformation matrices for both W_(M) and Y,so that both variables are mapped to the same points in space-time.Finally, for this embodiment of the invention, SPADE constraints wereset so that array I/O (X and Z), occurs at array boundaries, whilecoefficient matrices, C_(M1),C_(M2), can be constrained to reside in anon-time aligned fashion internal to the PE array. The advantage of thisconstraint is that edge-based memory structures do not have to becreated to distribute array I/O to the systolic array.

[0077] With all the constraints set as described above, SPADE producedseveral systolic array designs. However, all of the designs found hadrelatively high latencies, low throughputs and consumed substantialareas. Experimentation revealed that the designs with the lowestlatencies, smallest areas and good throughputs required one of either Xor Z to be resident internally in the array at the end of a singlecomputation. SPADE found six such latency optimal designs(L=latency=N/2+8) with a “bounding box” area of 5(N14+5). As a measureof throughput, the number of cycles between successive transformcomputations is used and denoted by T⁻¹ to distinguish it from thetransformation matrix T. Each of these designs had the same throughput,T⁻¹=N/4+6. For two of the designs Z/C_(M1) were time aligned and for theother four X/C_(M2) were time aligned. An example of a SPADE generatedoutput from each of these categories is shown in FIG. 1A wherein twoarray designs have been generated. The labels Y, X, Z, CM1, and CM2shown in FIGS. 1A and 1B indicate the regions of activity associatedwith the respective variables Y, X, Z, CM1, and CM2. A space-time viewof one of these designs is shown in FIG. 1B, and indicates that thespace-time mapping is somewhat irregular. The “IM” regions shown in FIG.1B perform the two running sums associated with the “add” functions inequation (12). The space-time view of FIG. 1B provides an indication ofactivity within any particular region with respect to time.

[0078] The computational efficiency inside the polygons and polytopes inFIG. 1B is 100%, i.e., there is a computation at each internalspace-time point each cycle for each variable at which a computationoccurs. High computational efficiencies are necessary to achieve thebest latencies. However, to achieve the maximum throughput, it isequally important to “fill” the entire space-time volume withcomputations. In order for this to be possible the spatial structure ofthe computation, as illustrated in FIG. 1B, must be replicable in timein such a manner that this is geometrically possible. It can been thatthe space-time structure in FIG. 1B contains irregularities which do notpermit a good “fit” between adjacent computational iterations in time.

[0079] In order to achieve better throughputs, SPADE was again used togenerate solutions with sub-optimal areas, but more regular space-timestructures. This does not represent a strong tradeoff because the PEarray “areas” are not uniform in FIG. 1A. That is, most of the PEsperform a complex addition, whereas the linear array for PEs associatedwith the variable Y do complex multiplications. By focusing on systolicarrays with larger areas, two latency optimal (L=N/2+8) and throughputoptimal (T⁻¹=N/4+1) solutions were found (FIG. 2A and FIG. 2B). Againeach of these requires one of either X or Z to be time aligned. That thespace-time computation region is efficiently used is demonstrated inFIG. 3 for two successive iterations of the DFT represented by (6). Timeincreases to right. However, the design in FIG. 3 requires an unusualplacement of a linear array of PEs with O(N) memory per PE at the arraycenter.

[0080] Each of the design types shown in FIG. 1 and FIG. 3 require thatX or Z be pre-loaded into or unloaded from the systolic array. Inprincipal this is a disadvantage because extra bandwidth must beprovided so that this operation can be “buried” in the computationcycle. However, there is an inherent advantage for the positioning of Xand Z in this case because the load/unload would be such that data issequenced in normal order during this period. This is not a smallconsideration because input/output formatters can consume considerableamount of space. In order to fully characterize the systolic arraydesigns in FIG. 2 it is necessary to specify the transformation matricesfor variables in (11). These are provided in Tables 1 and 2. Table 1provides the T/t TABLE 1 Transformation matrices for variables shown inFIG. 2(a) var Y IM var1 CM1 X Z IM var2 CM2 T $\begin{bmatrix}1 & 1 \\0 & 0 \\{- 1} & 0\end{bmatrix}\quad$

$\begin{bmatrix}1 & 1 & 1 \\{- 1} & 0 & 0 \\0 & {- 1} & 0\end{bmatrix}\quad$

$\begin{bmatrix}1 & 1 \\0 & {- 1} \\{- 1} & 0\end{bmatrix}\quad$

$\begin{bmatrix}{- 1} & 1 \\1 & 0 \\0 & {- 1}\end{bmatrix}\quad$

$\begin{bmatrix}1 & 1 & {- 1} \\0 & 0 & 1 \\0 & {- 1} & 0\end{bmatrix}\quad$

$\begin{bmatrix}{- 1} & 1 \\1 & 0 \\0 & 0\end{bmatrix}\quad$

t [5, 0, 0] [0, 5, 0] [0, 5, 0] [0, 5, 0] [27, −5, 0] [10, −5, 0] [10,−5, 0]

[0081] matrices for the variables and Table 2 provides the correspondingvectors ([time,x,y]) showing the data flow direction for each dependencyin (11). Note that the data flow for two dependencies indicates nospatial movement. This means that the variable remains in the PE towhich it was originally assigned and gets reused in that same PE eachcycle. TABLE 2 Dependencies (top) and corresponding propagation vectors(below) for design in FIG. 2(a). Y −> IM_var1 IM_var1 −> CM1 IM_var1 −>X Z −> IM_var2 IM_var2 −> CM2 IM_var2 −> Y [1, −1, 0] [1, 0, 0] [1, 0,−1] [1, 0, 0] [1, 0, −1] [1, −1, 0]

[0082] It is useful to briefly compare the present design with prior artdesigns. A variety of systolic array designs have been proposed forcomputation of the DFT. These are of two basic types, those that havesimilarities to systolic arrays for matrix-vector multiplication (Type Ain Table 3) and those based on factoring a 1-D DFT into two sets of sizen₁ and n₂ such that N=n₁n₂ (Type B in Table 3). Those array designs inthe Type A category have T⁻¹=N and L=2N−1. The factorization approach ofType B achieves a significant speedup because it can be done as a 2-DDFT while avoiding the need for a matrix transposition. This results inT⁻¹=2{square root}{square root over (N)}+1 and L=4{square root}{squareroot over (N)} when n₁=n₂={square root}{square root over (N)}. However,all the designs above use a number of complex multipliers and complexadders equal to the transform length.

[0083] One of the main advantages of this example of the base-4 approachof the above discussed embodiment is that it achieves good throughputswith reduced hardware complexity as illustrated in Table 3.Specifically, it reduces the number of complex multipliers by a multipleof 4, while the number of complex adders increases only by a multiple of2 for the same value of N compared to both the previous methods. Themain disadvantage of the circuit of this example is that the throughputis proportional to N vs. {square root}{square root over (N)}⁻¹ for thesecond approach, so that at some value of N it will be less efficient.In order to estimate a “crossover point” at which this occurs, one canuse a complexity measure to determine this. For example, if it isassumed that the complexity measure is “area-time”, the area of an adderis ¼ that of a multiplier and T⁻¹ is the measure for time, then thecrossover point is reached at N≈256. Consequently, the arrays describedin this example work best for short transforms.

[0084] Both the proposed designs in FIG. 2 and the factorizationapproach have similar design complexities in that they both require someformatting of incoming or outgoing data, they have the same number ofI/O ports, and similar overall bandwidth requirements and arithmeticcycles. TABLE 3 Comparison of different systolic DFT architectures TypeA Type B Example of FIG. 2 N M A T⁻¹ L M A T⁻¹ L M A T⁻¹ L 16 16 16 8 3116 16 9 16 4 32 5 16 64 64 64 64 127 64 64 17 32 16 128 17 40 256 256256 256 511 256 256 32 64 64 512 65 136

[0085] However, if the design of FIG. 3 were collapsed into a lineararray, the horizontal buses and PEs could potentially be replaced by anarray of registers properly interfaced to an array of adders, whichmight be a very favorable tradeoff. Use could also be made of the factthat the coefficient matrices transmit only ±1 and ±j, which requireslittle bandwidth and consequently only one high bandwidth bus might berequired for a linear array.

[0086] While the above-described embodiment has been directed atachieving cost effective designs with maximum throughputs, the presentinvention is not intended to be limited to this type of design. Forinstance, it is also possible to search for designs that achieve maximumthroughput irrespective of hardware costs. Here, two such designs areidentified as indicated in FIG. 4A and FIG. 4B. The primary constraintplaced on all previous SPADE design searches was that the variable Y betime aligned so that only a linear array of multipliers would beproduced. However, if this constraint is relaxed, designs can be foundthat provide significantly improved throughputs, two embodiments ofwhich are shown in FIGS. 4A and 4B. Each of these has T⁻¹=9,irrespective of the transform size. The design in FIG. 4A requires datato be formatted for input and output. The design of FIG. 4B requires noformatting of X and Z since the correct formatting is done internally tothe array itself (“on-the-fly”).

[0087] The previous sections discussed systolic designs for b=4;however, from (6) it's clear that there are benefits to using higherbases in that the size of W_(M) decreases leading to fewer complexmultiplications and higher throughputs. We have found that for base-bsystolic arrays (FIG. 3)

L=2N/b+2b

T ⁻¹ =N/b+1

[0088] which does point to benefits in using a higher base. The primarydrawback to using higher bases is that the coefficient matrices C_(B) nolonger contain the simple complex numbers {1, −1, j, −j}, because thecoefficients are taken from b points equally spaced on the unit circle.Base-8 designs still contain many simple complex numbers in C_(B),leading to a potential significant reduction in hardware, but it is notclear that the overall performance is better than that of the base-4design. Additionally, it suffers from less regularity in its design.

[0089] The present invention may also be utilized to produce 2-D N×N DFTdesigns based on the same algorithm formulation (6) as that for the 1-Dtransform. An attractive strategy from an algorithmic point of viewwould be the use of an N×N arrays of PEs that can perform the columntransforms followed by row transforms without the need for a matrixtransposition in between as is found in the prior art. However, this 2-Darray of multiplier/adders imposes significant hardware penalty.Alternatively, we propose here to build on the less hardware intensivelinear designs in FIG. 2 by noting that the output Z of the design inFIG. 2A could be used directly as the input X to the design of FIG. 2B.That is the 1-D column transforms can be done with the array of FIG. 2A,followed by the row transforms using the same array configuration ofFIG. 2B. However, for this approach to work directly, it would benecessary to transpose the data in between the row and column DFTs, avery time intensive extra operation.

[0090] Alternatively, it is possible to do the transposition“on-the-fly” during the course of the column DFTs performed in FIG. 2a.This can be done by systolic rotation of the elements of the matricesCM1, CM2 and W_(M) as the column DFTs are performed. To illustrate thisconcept the code in (113) can be altered to do N successive column DFTsincluding these rotations by writing it as:

[0091] for n to N do for j to N/4 do (14) for k to N/4 do Y[j,k] :=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..b) od; for k to b do Z[k,j] :=add(CM2[k,i]*Y[j,i],i=1..N/4) od; WM := matrix_rotate(WM,“down”); CM1 :=matrix_rotate(CM1, “down”); if n mod(b)=0 then CM2 :=matrix_rotate(CM2,“down”) fi; od; od; od;

[0092] where the procedure “matrix_rotate( )” does a circular rotationof the argument matrices in the direction indicated. Here the rotationsof WM and CM1 cause a rotation of the Z output columns one position tothe right, and the rotation of CM2 causes Z to be rotated downward byone row. The Z outputs from the code (14) for the first two column DFTsof a 64×64 2-D DFT obtained from this code would then be$\begin{bmatrix}z_{1} & z_{2} & z_{3} & z_{4} & z_{5} & z_{6} & z_{7} & z_{8} & z_{9} & z_{10} & z_{11} & z_{12} & z_{13} & z_{14} & z_{15} & z_{16} \\z_{17} & z_{18} & z_{19} & z_{20} & z_{21} & z_{22} & z_{23} & z_{24} & z_{25} & z_{26} & z_{27} & z_{28} & z_{29} & z_{30} & z_{31} & z_{32} \\z_{33} & z_{34} & z_{35} & z_{36} & z_{37} & z_{38} & z_{39} & z_{40} & z_{41} & z_{42} & z_{43} & z_{44} & z_{45} & z_{46} & z_{47} & z_{48} \\z_{49} & z_{50} & z_{51} & z_{52} & z_{53} & z_{54} & z_{55} & z_{56} & z_{57} & z_{58} & z_{59} & z_{60} & z_{61} & z_{62} & z_{63} & z_{64}\end{bmatrix} \cdot \begin{bmatrix}z_{16} & z_{1} & z_{2} & z_{3} & z_{4} & z_{5} & z_{6} & z_{7} & z_{8} & z_{9} & z_{10} & z_{11} & z_{12} & z_{13} & z_{14} & z_{15} \\z_{32} & z_{17} & z_{18} & z_{19} & z_{20} & z_{21} & z_{22} & z_{23} & z_{24} & z_{25} & z_{26} & z_{27} & z_{28} & z_{29} & z_{30} & z_{31} \\z_{48} & z_{33} & z_{34} & z_{35} & z_{36} & z_{37} & z_{38} & z_{39} & z_{40} & z_{41} & z_{42} & z_{43} & z_{44} & z_{45} & z_{46} & z_{47} \\z_{64} & z_{49} & z_{50} & z_{51} & z_{52} & z_{53} & z_{54} & z_{55} & z_{56} & z_{57} & z_{58} & z_{59} & z_{60} & z_{61} & z_{62} & z_{63}\end{bmatrix}$

[0093] After all 64 column DFTs are computed by the array in FIG. 2B,the rows would be distributed over the input PEs X in FIG. 2A whichcorrespond to the output PEs Z in FIG. 2B. The row DFTs could then beperformed with another similar set of rotations so that the final 2-Doutput is in the same radix-4 order as the original input. The rotationsoccur such that at the end of a row or column DFT, the matrices havereturned to their original positions. Also, the twiddle factors in thematrix WM are always in the correct positions for both row and columnDFT processing. Note that CM1, CM2, and WM are all matrices, whoseelements are fixed for a given transform length N.

[0094] The entire sequence of column and row DFTs is best understoodfrom viewing the systolic computation in the space-time domain as shownin FIG. 5. (Only two of the 64 row and column DFTs are shown.) It can beseen that successive row/column iterations can be “stacked” on top ofeach other in time so that all PEs are active all of the time and thusthat the design is throughput optimal with the exception of the smallperiod of time period at the end of the column DFTs when systolicprocessing switches from the design of FIG. 2B to that of FIG. 2A.Control of the PEs can be done in a number of ways, as for example usinga set of modulo counters in each PE to keep track of that PE's requiredfunctionality or propagating control signals along with the data.

[0095] A comparison of prior art 2-D systolic array designs forcomputing the 2-D DFT TABLE 4 Comparison of area and timecharacteristics or base-4 and 2-D systolic implementations. Here M and Acorrespond to the number of complex multipliers and adders. 2-D ArrayBase-4 Array M N² N/4 A N² 2N CPD 2N + 1 N²/2 + 5N/2 + 8 L 4N N²/2 +9N/4 + 4 ˜area × throughput = (M + A/4) 5N³/2 + 5N²/4 3N³/8 + 15N²/8 +6N CPD

[0096] and the base-4 design described herein is shown in Table 4.

[0097] The base-4 design is very different in that it is pseudo linearbecause as the transform size increases, it's width remains the same,while its length increases so that it uses O(N) hardware rather thanO(N²) when comparing non-partitioned designs of each. However, whenconsidering throughput and area together, the base-4 design is moreefficient by a factor of 6 for N=64, as indicated in Table 4, where itwas assumed that the area of an adder was ¼ that of a multiplier. Itshould be noted that the base-4 design can be used to efficientlyperform a 1-D DFT by factoring it into two sets of size n₁ and n₂ suchthat N=n₁n₂ in the same way as a prior art 2-D systolic array andperforming row and column DFTs on n₁ and n₂.

[0098] The particular implementation of the array hardware from thearray designs discussed above is relatively straightforward but isbeyond the scope of the invention and depends upon the specificstructures of the available hardware, numerous circuit designer choices,the I/O requirements, suitable simulation testing and necessary circuitrevisions, and so forth. FIG. 6-FIG. 10 is provided to show a conceptualhardware implementation of a systolic array as discussed above in accordwith the present invention. However, it will be understood that thesefigures are provided only as a generalized or rough example to aid inunderstanding the invention that provides improved structural concepts,organization, and features of systolic arrays for computing DFT and arenot intended to be limiting of the invention. The figures are also notintended to provide a working example which would accordingly requiremore specific hardware details, hardware design choices, as well assimulation testing, and so forth that is beyond the scope of thisdiscussion. Suitable hardware for systolic array hardware implementationpresently includes FPGAs, which are constructed from tiling identicalmemory and logic blocks along with supporting mesh interconnectionnetworks, in a way that matches the designed systolic array. This typeof hardware allows dynamic implementation of systolic designs leading toinexpensive “programmable” systolic array hardware. Use of applicationspecific integrated circuits (ASICS) provides for more efficient andfaster implementations, but ones that are not programmable.

[0099] Accordingly, in FIG. 6, a view of systolic array hardwareimplementation 100 for computing the DFT is provided to show theconceptual construction of a hardware implementation similar in designto that of the one-dimension pipelined systolic arrays of FIG. 2A andFIG. 2B but wherein the DFT transform length is thirty-two rather thansixty-four. (The design in FIG. 6 is conceptually a composite of the twodesigns in FIGS. 2A and 2B, each rotated 180 degrees.) It will beappreciated that, in accord with the present invention, the arraydesigns are easily scalable so that more or fewer components can beadded as desired depending on the DFT transform length to be calculated.As can be seen in table 3, for a DFT transform length of thirty-two,eight fixed point complex multiplier processing elements 102 areutilized, and sixty-four fixed point adders are utilized as found inadder array sections 107 and 112. Complex multiplier processing elementsform a linear array having a length equal to N/4, e.g., the transformlength divided by 4. In the circuit of FIG. 6, where the transform has alength of 32, N/4 will provide a linear array of eight multipliercircuits. As discussed above, the complex multiplier processing elements102 are utilized for calculating Y from equation (6) internally withinsystolic array 100.

[0100] Array 107 and 112 of complex adders each comprise two lineararrays organized in a 4×N/4 format of complex adder processing elements.In other words, each array 107 and 112 has a width of four complex adderprocessing elements. Each array 107 and 112 has a length of N/4 complexadder processing elements or eight for this circuit eight complex adderprocessing elements.

[0101] It will be seen in FIG. 6 that the systolic array is comprised ofa virtual mesh of North/South and East/West interconnections 103 betweenthe processing elements. For convenience in understanding, the symbolicinterconnections 103 have arrows which show the direction of data flowbetween the processing elements.

[0102] External to the systolic arrays of processors are memorystructures 130 that contain inputs X (or outputs Z when doing a 2-DDFT). Memory structures internal to adder sub array 112 may be utilizedthat contain outputs Z (when doing a 1-D DFT). In a preferredembodiment, each PE of the linear array of complex multipliers 102 iscapable of storing N/4 complex data values from the matrix W.

[0103] Control structures such as control 128 provide clocks that may beutilized to ensure synchronous operation of the processing elements.Internal circuits within each type of PE may be utilized tosynchronously control receipt of inputs, computation of results, anddistribution of results to adjacent PEs. The control structures are alsoutilized to reset all PEs such that they are ready to compute anotherDFT.

[0104] If the circuit of the type shown in FIG. 6 is utilized to providea 2-D DFT as discussed hereinbefore, additional circuitry may beutilized to provide matrix shifts or rotations as heretofore described.In this way, the same array, with the hardware for each array arrangedgenerally in the same manner shown in FIG. 6, could be utilized forprocessing the 1-D column and row DFTs, which when combined wouldprovide a 2-D DFT.

[0105]FIG. 7 provides an enlarged view of the one possible conceptualconfiguration for complex multiplier processing element 102. Withincomplex multiplier processing elements 102, matrix W_(M) is insertedand/or stored in FIFO element 104. The product C_(M1)X is applied toinput 106 from adder array section 107 whereupon complex multiplicationof W_(M) times C_(M1)X takes place in 108. The results may be stored andadded in pipe register 105 and adder 109 and then passed on at output110 to adder array section 112 to effect the product C_(M2)Y asdescribed hereinbefore which involves only addition due to theconstruction of matrices C_(M1) and C_(M2). When doing a 2-D DFT, the1-D DFTs that have been performed can be transferred from the internalregisters in 112 to memory elements 114. The these results can beretrieved from 114 to do the 1-D row DFTs.

[0106]FIG. 8 and FIG. 9 provides enlarged views of possibleconceptualized embodiments for complex adder processing elements 115 and116 which may be utilized in adder array section 107 and adder arraysecond 112, respectively. Complex adder processing element 115 isutilized to multiply (which requires only addition as discussed above)C_(M1)X. Elements 118 may control synchronization, operation, inputs andoutputs of processing element 115 to thereby receive C_(M1) and Xwhereupon the sum is determined element 120, the result of which istransferred to output 122 in accord with the flow direction ofinformation as indicated in FIG. 2B and FIG. 6. Likewise, in complexfixed point adder processing elements 116, control of operations may beperformed by control element 124. C_(M2) and/or Z may be placed instorage element 126. Y is received from the multiplier array. It will beunderstood that the individual processing elements such as adders andmultipliers may vary in construction as desired. However, in accord withsystolic array concepts, the architecture of the individual processingelements will be duplicated. Therefore, in this example, all adders inadder array section 107 are identical. Likewise the adders in arraysection 112 for this embodiment are identical but may not exactlyidentical with those of array section 107. Moreover the multipliers 102,which form a linear array, are the identical to each other. The use ofthe repeated structures in systolic arrays is a significant cost benefitof such devices.

[0107] Control and memory elements, such as control elements 128 andmemory elements 130 may be utilized to input data such as time domaindata X into systolic array hardware implementation 100 to perform theDFT thereon. One possible conceptual implementation for control element128 is shown in FIG. 10. Element 132 is utilized to interface withexternal instrumentation that provides the time domain data X or, inother implementations, could provide frequency domain data. Controllers134 and 136, as well as register 138, operate to input the time domaindata into memory elements 130 whereby the data may be inserted intoadder array section 107 and array activity such as add and multiplyfunctions are coordinated.

[0108] In summary, it will be seen that numerous different systolicarrays can be derived from or generated utilizing the present methodthat is based on arrays derivable from equation (6), including onedimensional as well as two dimensional arrays and/or multi-dimensionalarrays. The generated arrays can be selected to provide a veryeconomical but efficient systolic array structure implementation asindicated FIG. 2A or 2B and FIG. 6. To compute larger DFT transformlength, it is only necessary to scale the array.

[0109] For instance, the conceptualized circuit of FIG. 6 is scaled fora DFT transform length of thirty-two but by increasing theone-dimensional pipeline systolic array configuration, it is easilyaltered to provide for a DFT transform length of 64 as shown in FIG. 2Aor FIG. 2B. The number of multipliers is greatly reduced by a factor ofat least four as compared to the prior art systolic devices and thenumber of adders is only increased only by a factor of two. Thus, thenumber of multiplications is greatly reduced. This compares veryfavorably with prior art non-systolic designs because the transformlength, in presently preferred embodiments, need only be an integermultiple of sixteen whereas in the prior art non-systolic radix-4designs must satisfy N=4^(m) where N is the transform length and m is aninteger. However, the present invention is not intended to be limited tothese transform lengths except in the particular embodiments thereof asdescribed in the accompanying claims which specifically include thislimitation. There are no prior art radix-4 systolic designs. Radix-4designs in the past have only been associated with non-systolic specialpurpose “pipelined” FFTs. Moreover, equation (6) can also be utilized toconstruct hardware that very quickly computes the DFT, regardless oftransform size, as indicated in FIG. 4. The arrays can also be designedto reduce I/O requirements and/or other requirements that may apply tospecific computation needs of a particular instrument.

[0110] Therefore, it will be understood that many additional changes inthe details, materials, steps and arrangement of parts, which have beenherein described and illustrated in order to explain the nature of theinvention, may be made by those skilled in the art within the principleand scope of the invention as expressed in the appended claims.

What is claimed is:
 1. A method for providing a digital processorarchitecture for computing a discrete Fourier transform, said discreteFourier transform wherein one or more matrices Z of frequency domainvalues are equal to one or matrices X of time domain values times one ormore coefficient matrices, said method comprising: providing that saidone or more matrices Z of frequency domain values is determinable by aplurality of matrices multiplied together wherein at least two of saidplurality of matrices have matrix products which involve only complexnumber additions; generating one or more systolic array designs whereinsaid one or more systolic arrays designs each comprise a plurality ofprocessing elements, each of said plurality of processing elements beingadjacent one or more neighboring processing elements, each of saidplurality of processing elements being electrically interconnected to atleast one of said one or more neighboring processing elements, saidplurality of processing elements comprising one or more arrays ofcomplex adder processors for determining said matrix products whichinvolve said only said complex number additions.
 2. The method of claim1, wherein said one or more systolic array designs have a structurebased on equations of the form Y=W ^(t) _(M) ·C _(M1) X Z=C _(M2) Y ^(t)wherein Y, Z, W^(t) _(M), C_(M1), C_(M2), and X comprise one or morematrices, and wherein the matrix products C_(M1)X and C_(M2)Y^(t)involve only complex number additions or at least primarily complexnumber additions, organizing said one or more arrays of complex adderprocessors being for determining said matrix products which involve onlyor primarily said complex number additions of matrix products C_(M1)Xand C_(M2)Y^(t), providing that said plurality of processing elementscomprises at least one matrix product array operable for performingcomplex number multiplication, and organizing said at least one matrixproduct array for determining a matrix product of W^(t) _(M) andC_(M1)X.
 3. The method of claim 1, wherein said one or more arraydesigns are compatible for use with any transform length which is aninteger multiple of
 16. 4. The method of claim 1, further comprisingorganizing said plurality of processing elements into regular blocks ofsub matrices.
 5. The method of claim 4, wherein said one or more arraydesigns are scaleable for to different transform lengths by varying anumber of said regular blocks of sub matrices.
 6. The method of claim 5,wherein said one or more array designs comprises a pipelined lineararray which is scaleable so that said number of said regular blocks ofsub matrices are added only to one length dimension of said pipelinedlinear array to permit processing of an increased transform length. 7.The method of claim 1, wherein at least one matrix of said plurality ofmatrices contains elements that are found only in the set {i, −i, 1, −1}such that matrix multiplication with said at least one matrix requiresno multiplication.
 8. The method of claim 1, wherein said complex adderprocessing elements are not required to both add and multiply.
 9. Themethod of claim 1, wherein said digital hardware discrete Fouriertransform processor is operable for performing a two-dimensionaldiscrete Fourier transform.
 10. The method of claim 1, wherein saidplurality of processing elements required include a maximum number ofmultiplier processing elements that is equal or less than a transformlength divided by four.
 11. A digital processor architecture forcomputing a discrete Fourier transform, said discrete Fourier transformhaving a transform length N, where N is an integer multiple of sixteen,said processor comprising: at least one systolic array formed from aplurality of processing elements, said plurality of processing elementscomprising a plurality of complex multiplier processing elements and aplurality of complex adder processing elements, said plurality ofcomplex multiplier processing elements in said at least one systolicarray being limited in number to no more than N/4.
 12. The digitalprocessor architecture of claim 11, wherein said plurality of complexadder processing elements are organized into one or more sub arrays suchthat said one or more sub arrays have a first dimension comprising fourcomplex adder processing elements and a second dimension comprising N/4complex adder processing elements.
 13. The digital processorarchitecture of claim 11, wherein said plurality of complex adderprocessing elements are not operable for multiplying functions otherthan unity multiplication.
 14. The digital processor architecture ofclaim 11, further comprising: said at least one systolic array isoperable to perform a discrete Fourier transform having a transformlength of n1 and also operable to perform a discrete Fourier transformof length n2, whereby N=n1 *n2.
 15. The digital processor architectureof claim 11, wherein a structure of said one or more arrays has astructure based on equations of the form: Y=W ^(t) _(M) ·C _(M1) X Z=C_(M2) Y ^(t) wherein Y, Z, W^(t) _(M), C_(M1), C_(M2), and X compriseone or more matrices, and the matrix products C_(M1)X and C_(M2)Y^(t)involve only complex number additions.
 16. The digital processorarchitecture of claim 11, wherein said plurality of complex adderprocessing elements in said at least one systolic array is limited innumber to no more than 2N.
 17. A digital processor architecture forcomputing a discrete Fourier transform, said discrete Fourier transformwherein one or more matrices Z comprise frequency domain values and isequal to one or more matrices X of time domain values times one or morecoefficient matrices, said digital processor architecture comprising: atleast one systolic array formed from a plurality of processing elementsfor said at least one systolic array, said plurality of processingelements comprising a plurality of complex multiplier processingelements and a plurality of complex adder processing elements; and saidat least one systolic array having a structure based on equations of theform: Y=W ^(t) _(M) ·C _(M1) X Z=C _(M2) Y ^(t)  wherein Y, W^(t) _(M),C_(M1), and C_(M2) comprise one or more matrices and the matrix productsC_(M1)X and C_(M2)Y^(t) involve only or primarily complex numberadditions.
 18. The digital processor architecture of claim 17, whereinC_(M1) and C_(M2) consists only of elements defined by the set {i, −i,1, −1}.
 19. The digital processor architecture of claim 17, whereinC_(M1) and C_(M2) consists only of elements defined by the set{1/sqrt(2)*(i+1), 1/sqrt(2)*(−i+1), 1/sqrt(2)*(i−1), 1/sqrt(2)*(−i−1),i, −i, 1, −1}.
 20. The digital processor architecture of claim 17,further comprising memory structures external to said at least onesystolic array that contain input values for X and coefficients C_(M2).21. The digital processor architecture of claim 17, wherein saidplurality of complex multiplier processing elements comprise memorystructures capable of storing data values from matrix W^(t) _(M). 22.The digital processor architecture of claim 17, wherein said pluralityof complex multiplier processing elements are organized into aone-dimensional array having a length of N/4 complex processing elementswhere N is a transform length of said discrete Fourier transform. 23.The digital processor architecture of claim 17, wherein said pluralityof complex adder processing elements are organized into one or more subarrays of said at least one systolic array such that said one or moresub arrays has a first dimension comprising four complex adderprocessing elements and a second dimension comprising N/4 complex adderprocessing elements.
 24. The digital processor architecture of claim 17,wherein said C_(M1)X and C_(M2)Y^(t) involve only complex numberadditions.
 25. The digital processor architecture of claim 17, furthercomprising memory structures external to said at least one systolicarray that contain output values for Z and input coefficients C_(M1).26. A digital processor architecture for computing a discrete Fouriertransform, said discrete Fourier transform having a transform length N,said processor comprising: at least one first systolic array comprisinga first plurality of processing elements, said at least one systolicarray being organized into one or more first subarrays, each firstsubarray having a number of processing elements equal to N divided by abase.
 27. The digital processor architecture of claim 26, furthercomprising at least one second systolic array comprising a secondplurality of processing elements, said at least one systolic array beingorganized into one or more second subarrays, each second subarray havinga number of processors equal to the product of N divided by a base btimes b and wherein N is an integer multiple of b² and said base beingequal to a power of two.
 28. The digital processor architecture of claim26, further comprising said at least one first systolic array beingoperable to perform a discrete Fourier transform having a transformlength of n1 and also being operable to perform a discrete Fouriertransform of length n2, whereby N=n1*n2.
 29. The digital processorarchitecture of claim 28, further comprising shifting circuitry operableto provide shifts or rotations of matrix elements to support on-the-flypermutations of data for use in performing 2D discreet Fouriertransforms.
 30. A digital processor architecture for computing adiscrete Fourier transform, said discrete Fourier transform having atransform length N, wherein N is an integer, said processor comprising:at least one systolic array comprising a first plurality of processingelements; at least one multiplier subarray forming a portion of said atleast one systolic array, said at least one multiplier array consistingof N divided by a base b linearly aligned multiplier systolic elementswhereby N must be divisible by b²; and at least two complex addersubarrays, each of said at least two complex adder arrays with a firstdimension consisting of N/b adder systolic elements, and a seconddimension consisting of b adder systolic elements.
 31. The digitalprocessor architecture of claim 30, wherein b=4 and N is an integermultiple of
 16. 32. The digital processor architecture of claim 30,wherein said adder systolic elements comprise complex adder elements andsaid multiplier systolic elements comprise complex multiplier elements.33. A digital processor architecture for computing a discrete Fouriertransform, said discrete Fourier transform having a transform length N,wherein N is an integer, said processor comprising: at least onesystolic array comprising a first plurality of processing elements; atleast one multiplier subarray forming a portion of said at least onesystolic array, said at least one multiplier array having a firstdimension consisting of N/b multiplier systolic elements whereby b is abase and N must be divisible by b²; said at least one multiplier arrayhaving a second dimension consisting of N/b multiplier systolicelements; and at least one adder subarray forming a portion of said atleast one systolic array, said at least one adder array comprisingsystolic elements which are at least primarily adder systolic elements,said at least one adder subarray consisting of an order of N times anorder of N systolic elements.
 34. The digital processor architecture ofclaim 33, wherein b=4 and N is an integer multiple of
 16. 35. Thedigital processor architecture of claim 33, wherein said adder systolicelements comprise complex adder elements and said multiplier systolicelements comprise complex multiplier elements.
 36. The digital processorarchitecture of claim 33, wherein said at least one adder subarray is insurrounding relationship to said at least one multiplier array.
 37. Amethod for designing a digital processor architecture for computing adiscrete Fourier transform wherein one or more matrices Z comprisefrequency domain values and is equal to one or more matrices X of timedomain values times one or more coefficient matrices C such that Z=CX,where C is a coefficient matrix containing elements W_(N)^(kn)=e^(−2*j*π*(n−1)*(k−1)/N), said method comprising: providing atleast one systolic array formed from a plurality of processing elements,said plurality of processing elements comprising a plurality of complexmultiplier processing elements and a plurality of complex adderprocessing elements; transforming C into a desired base-b format byproviding a permutation matrix P that reorders X and Z according to${X_{b} = {{P\begin{bmatrix}X_{1} \\X_{2} \\X_{3} \\X_{4} \\X_{5} \\\vdots \\X_{13} \\X_{14} \\X_{15} \\X_{N}\end{bmatrix}} = \begin{bmatrix}X_{1} \\X_{1 + {N/4}} \\X_{1 + {N/2}} \\X_{1 + {3{N/4}}} \\X_{2} \\\vdots \\X_{N/b} \\X_{2{N/b}} \\X_{3{N/b}} \\X_{N}\end{bmatrix}}},{{{and}\quad Z_{b}} = {P\quad {Z\quad.}}}$

 whereby C can be transformed into C_(b)=PCP^(t), so thatZ_(b)=C_(b)X_(b); said at least one systolic array having anarchitectural structure based on an equation of the form C_(b)=PCP^(t)where b is said desired base-b format.
 38. The method of claim 37wherein b=4.